home *** CD-ROM | disk | FTP | other *** search
-
-
-
- - 1 -
-
-
-
- 4. _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s
-
- 4.1 _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s__i_n__M_i_n_e_S_e_t__2_._5
-
- 4.1.1 _G_e_n_e_r_a_l__C_h_a_n_g_e_s MineSet 2.5 introduces several new
- features that are described in this chapter, including
- parallelization, clustering, regression, and decision
- tables.
-
- In addition, MineSet 2.5 includes:
-
- +o Boosting classifiers: A general algorithm that can be
- applied to inducers which often increases their
- accuracy.
-
- +o Display training set as disks: When backfitting is
- enabled as a decision tree, option tree, or regression
- tree is built, it is now possible to see the
- relationship between the distribution of the training
- set at each of the nodes in the tree and the complete
- data set (training set + test set). The bars in the
- treeviz display will still represent the total backfit
- distribution, but with 'display training set as disks'
- enabled, the relative distribution of the training set
- will also be displayed, as treeviz disks over the bars.
- This allows you to visually verify, if you wish, the
- validity of the random selection of the training set
- out of the complete set.
-
- +o Evidence Visualizer features: The ability to have
- multiple selections, choices between conditional
- probability cakes or probability pies, loss matrices,
- Laplace correction toggle a toggle to display nulls,
- and an alternative landscape viewer among other new
- features.
-
- +o Tool Manager: Aside from changes to support the new
- analytical tools, the only significant new features for
- Tool Manager in release 2.5 are found on the "Edit
- History" panel. "Edit History" used to bring up a
- separate dialog in which the data transformations were
- shown as a graph. In version 2.5, the button has been
- changed to "View History." When you click this button,
- the transformation graph appears in the same window
- instead of a dialog box, and the menu options are still
- available at the top of the window. Click the "View
- Op-At-A-Time" button to return to the standard, one
- operation at a time view. The usability of the "View
- History" mode has been improved, and a new "View Data"
- button has been added to allow you to see the data (or
- a sample of the data) at any point in the history.
-
-
-
-
-
-
-
-
-
-
-
- - 2 -
-
-
-
- +o Web Extensions: All the new tools can now be launched
- from the web. MineSet Web Extensions now fully
- supports the creation and visualization of the new and
- existing MineSet visualizer files.
-
- +o Parallelization: Discretization and tree induction
- algorithms (decision, option, regression trees) have
- been parallelized. Discretization is parallelized with
- respect to the number of attributes to discretize.
- Tree induction algorithms have been parallelized
- topologically: computations in different branches of
- the tree run in parallel.
-
- The _M_i_n_e_S_e_t _U_s_e_r'_s _G_u_i_d_e refers to the generic mining
- application engine "MIndUtil," although in reality
- there exist two separate applications: the parallel
- MIndUtil_p and the single-threaded MIndUtil_s.
- Pragmatically speaking, the parallel MIndUtil_p should
- only execute on multiprocessor systems and an IRIX
- release which supports multi-threaded applications:
- IRIX 6.2 (with patches), 6.4 (with patches), or 6.5.
- Otherwise, the single-threaded MIndUtil_s is the most
- efficient choice.
-
- Since the O2 is a uniprocessor machine, parallelization
- is not available in the IRIX 6.3 version of MineSet.
-
- 4.1.2 _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s__t_o__t_h_e__S_t_a_t_i_s_t_i_c_s__V_i_s_u_a_l_i_z_e_r
-
- +o The Statistics Visualizer is now a separate
- application, as contrasted to MineSet 2.0/2.0.1 where
- it existed as a semiautonomous subwindow of the Tool
- Manager.
-
- 4.2 _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s__i_n__M_i_n_e_S_e_t__2_._0_._1
-
- 4.2.1 _G_e_n_e_r_a_l__C_h_a_n_g_e_s
-
- +o The Associations Rule Generator now accepts input from
- flat files as well as databases. The Tool Manager
- interface for Associations has been changed to support
- this and to simplify the process of working with
- Associations. Use of "assoccvt" for the creation of
- "assoc" binary files now occurs automatically and
- invisibly, thus the buttons for creation and selection
- of these binary files have been removed. N.B. If you
- wish to run the Rule Visualizer without running
- Associations, you can do so using the Tool Manager's
- "Visual Tools" menu.
-
-
-
-
-
-
-
-
-
-
-
-
-
- - 3 -
-
-
-
- +o Client speed for reading MineSet binary files is
- considerably faster than in version 2.0.
-
- +o MineSet 2.0.1 complies with the X/Open guidelines for
- dates past the year 2000. Previous versions of MineSet
- had already used 4 digit year fields for ascii output,
- and an internal date/time format which handles dates
- well beyond 2000. The only change from previous
- versions is that when MineSet reads from externally
- prepared ascii files in which dates have 2 digit year
- format, year fields are interpreted with 00-68 being
- 2000-2068 and 69-99 being 1969-1999.
-
- 4.3 _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s__i_n__M_i_n_e_S_e_t__2_._0
-
- 4.3.1 _G_e_n_e_r_a_l__C_h_a_n_g_e_s
-
- +o All the visual tools except for the Rule Visualizer and
- Evidence Visualizer support multiple selection,
- allowing selection of multiple objects in the scene.
- The data associated with all selected objects may be
- viewed by choosing _S_e_l_e_c_t_i_o_n_s/_S_h_o_w _V_a_l_u_e_s from the
- tool's menu. For most visual tools, multiple selection
- is accomplished using Shift-Left mouse click. (In the
- Splat Visualizer it is accomplished by drawing a box
- around the selections.)
-
- +o All the visual tools except for the Rule Visualizer
- support "Drill Through". This allows you to select one
- or more objects, and send a request to the Tool Manager
- to fetch the original data. There are two options.
- _S_e_l_e_c_t_i_o_n_s/_S_h_o_w _O_r_i_g_i_n_a_l _D_a_t_a tells Tool Manager to
- bring up a table of the original data that resulted in
- the selections, while _S_e_l_e_c_t_i_o_n_s/_S_e_n_d _t_o _T_o_o_l _M_a_n_a_g_e_r
- tells the Tool Manager to insert a filter operation,
- allowing the user to launch other visualizations or
- mining tools on the selected data.
-
- +o A new tool, the Splat Visualizer (splatviz), aggregates
- large amounts of data, and displays it using
- transparent graphical objects (splats). Using this
- tool one can interactively view data which has very
- many records.
-
- +o A Statistics Visualizer displays basic statistics of
- the data, including mean, standard deviation,
- quartiles, number of values, and histograms. The
- Statistics Visualizer is built into the Tool Manager.
-
- +o A Record Viewer replaces the Text Editor for viewing
- MineSet data files. This displays the data in tabular
-
-
-
-
-
-
-
-
-
-
-
- - 4 -
-
-
-
- form.
-
- +o MineSet data files now default to a more compact,
- faster-to-read binary format. The ASCII format is
- still supported and may be specified via the Tool
- Manager _P_r_e_f_e_r_e_n_c_e_s panel.
-
- +o The visual tools can save and print images of
- themselves. (However, in Release 2.0/2.0.1, due to a
- limitation in the implementation, this functionality is
- only available when displaying on a Silicon Graphics
- workstation. See the _K_n_o_w_n _P_r_o_b_l_e_m_s _a_n_d _W_o_r_k_a_r_o_u_n_d_s
- section for more details.)
-
- +o The visual tools' Animation Panel has three new buttons
- below the VCR-line buttons which control the play mode:
- Play-Once, Loop, and Swing. In the default Play-Once
- mode, the animation follows the drawn path from
- beginning to end (or end to beginning, for Play
- Reverse) and stops. In Loop mode, the animation
- follows the drawn path from beginning to end (or end to
- beginning), then seamlessly and indefinitely repeats.
- In Swing mode, the animation follows the drawn path
- from beginning to end, then backward from the end to
- the beginning, then again from beginning to end, ad
- infinitum.
-
- +o All configuration files now include a version number
- "MineSet 2.0" as the first line.
-
- +o A symbolic link was added so that /_u_s_r/_l_i_b/_m_i_n_e_s_e_t can
- be used in place of /usr/lib/MineSet.
-
- +o Several of the images have been moved from
- _M_i_n_e_S_e_t__c_o_m_m_o_n to _M_i_n_e_S_e_t.
-
- +o The utilities mineset2sas and sas2mineset have been
- added for converting files between MineSet and SAS
- format.
-
- +o Setting the environment variable MINESET_WARN_EXECUTE
- will have the same effect as launching all visual tools
- with the -warnexecute option, and will cause the visual
- tools to issue a warning before executing a user
- specified command.
-
- +o A -quiet option has been added to the visual tools. If
- this option is specified, the tools will not pop up
- dialogs when they are busy. This can be turned on
- permanently by adding the line
- *minesetQuiet:TRUE
-
-
-
-
-
-
-
-
-
-
-
- - 5 -
-
-
-
- +o For users familiar with Inventor, it is possible to turn on the
- Inventor menu by setting the X resource
- *minesetInventorMenu:True
- to your .Xdefaults file.
-
- +o For scatterviz and splatviz, automatic spinning may be enabled using
- the following X resources:
- Scatterviz*SoXtExaminerViewer.spinAnimation: on
- or
- Splatviz*SoXtExaminerViewer.spinAnimation: on
-
- 4.3.2 _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s__t_o__t_h_e__T_r_e_e__V_i_s_u_a_l_i_z_e_r
-
- +o Because Shift Left mouse is now used for multiple
- selection, you must use the Control key to indicate
- that a zoom is not to take place.
-
- +o When a bar is selected, the zooming will take place to
- view the complete base on which the bar rests rather
- than only the individual bar. Clicking on any bar on a
- give base will zoom to the same location as clicking on
- the base itself.
-
- +o The Filter Panel now contains filtering criteria
- similar to the Search Panel, but it filters out the
- nodes that don't match rather than highlighting those
- that do.
-
- +o In the Main window, clicking Mouse button 3 can bring
- up a menu to select the children of a node. If you
- click on a node with children, it will give you a list
- of the children of that node. If you do not click on a
- node, but a node is selected, it will give you a list
- of children of the selected node. If nothing is
- selected, or if the selected node has no children, no
- menu will be displayed.
-
- +o New external Control buttons have been added to move to
- the sibling to the left or right of the current
- selection, to move to the first or last child of the
- current selection, or to provide a list of children of
- the current selection. These have also been added to
- the Go menu except for the list of children.
-
- +o The distinction between scale and max has been
- eliminated in the configuration file. Scale is now the
- recommended option, and can be used wherever max was
- previously required. For compatibility, max can also
- be used wherever scale can be used.
-
-
-
-
-
-
-
-
-
-
-
-
-
- - 6 -
-
-
-
- +o The execute statement can now be specified via the tool
- options in the Tool Manager.
-
- +o The Search Panel now has a _S_e_l_e_c_t button which will
- select everything that matched the previous search.
-
- 4.3.3 _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s__t_o__t_h_e__S_c_a_t_t_e_r__V_i_s_u_a_l_i_z_e_r
-
- +o The Scatter Visualizer now supports an execute
- statement similar to the Tree and Map Visualizers.
- This can be specified in the Tool Manager or edited
- directly into the configuration file.
-
- +o The Filter Panel has been moved from the Filter menu to
- the View Menu. _S_e_t _L_a_n_d_s_c_a_p_e _t_o _F_i_l_t_e_r has been
- renamed _S_c_a_l_e _t_o _f_i_l_t_e_r, moved into the Filter Panel,
- and defaults to on.
- to your .Xdefaults file.
-
- 4.3.4 _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s__t_o__t_h_e__M_a_p__V_i_s_u_a_l_i_z_e_r
-
- +o The execute statement, the "map outlines" geo hierarchy
- file, and the "color normalize" statement can now be
- specified via the tool options in the Tool Manager.
-
- +o The _V_i_e_w menu now supports a Filter Panel.
-
- +o The _S_e_l_e_c_t_i_o_n_s menu supports the customary options seen
- in the other tools (_S_h_o_w _V_a_l_u_e_s, _S_h_o_w _O_r_i_g_i_n_a_l _D_a_t_a,
- _S_e_n_d _T_o _T_o_o_l _M_a_n_a_g_e_r, and _C_o_m_p_l_e_m_e_n_t_a_r_y _D_r_i_l_l _T_h_r_o_u_g_h),
- and in addition supports _S_e_l_e_c_t _A_l_l (all the objects in
- the scene become selected).
-
- 4.3.5 _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s__t_o__t_h_e__D_a_t_a__M_o_v_e_r
-
- +o The Data Mover no longer uses Oracle-provided library,
- libclnsh.so, to connect to Oracle databases. Because
- of this, there is no longer a need for a local Oracle
- installation when MineSet is to access a remote Oracle
- database.
-
- +o The Data Mover now reads and writes files in the
- MineSet binary file format in addition to the ASCII
- format.
-
- +o Filtering, i.e., allowing only records satisfying a
- specified condition to pass, is now supported as
- streaming operation.
-
- +o Random sampling of records is now supported as a
- streaming operation. This comes in two forms, one in
-
-
-
-
-
-
-
-
-
-
-
- - 7 -
-
-
-
- which the user specifies a desired resulting sample
- size, and one in which the user specifies an
- approximate percentage of records to include in the
- sample (accept records with probability p).
-
- +o Data Mover has a now accumulates basic statistical
- information on a data source. The resulting data is
- used to support the Statistics Visualizer.
-
- 4.3.6 _C_h_a_n_g_e_s__a_n_d__A_d_d_i_t_i_o_n_s__t_o__t_h_e__A_n_a_l_y_t_i_c_a_l__M_i_n_i_n_g__T_o_o_l_s
-
-
- +o An Option Tree Inducer and Classifier have been added
- to the set of inducers available under the Mining Tools
- Classify tab.
-
- +o The classifiers and inducers have been extended to work
- with record weights.
-
- +o The classifiers and inducers can now utilize a user
- specified loss matrix that indicates the loss (or cost)
- associated with various types of classification errors.
-
- +o Generating a learning curve has been added as a new
- classifier mode. A learning curve assesses how the
- classifier's error rate is affected by the number of
- training records.
-
- +o Accuracy estimation has been changed to error
- estimation. The _E_s_t_i_m_a_t_e _e_r_r_o_r _m_o_d_e now generates a
- model from the whole dataset in addition to estimating
- the error using cross validation.
-
- +o Decision Trees and Option Trees now show the estimated
- error for every node, allowing users to better
- understand where the model is more accurate and where
- it is not. This estimate is now mapped to color,
- replacing the purity mapping used in MineSet 1.X.
-
- +o The inducers now generate classifiers that are capable
- of estimating probabilities (scoring), not just
- classifying records. This option is available through
- the apply-classifier transformation.
-
- +o Lift curves, showing the effectiveness of the
- probability estimates, can be generated from _F_u_r_t_h_e_r
- _i_n_d_u_c_e_r _o_p_t_i_o_n_s and under _A_p_p_l_y _C_l_a_s_s_i_f_i_e_r'_s _t_e_s_t
- _c_l_a_s_s_i_f_i_e_r. Lift curves show how effectively a
- classifier can distinguish a specified label value from
- all other label values.
-
-
-
-
-
-
-
-
-
-
-
-
- - 8 -
-
-
-
- +o Confusion matrices, showing the specific types of
- errors that the classifier makes, can be generated from
- _F_u_r_t_h_e_r _i_n_d_u_c_e_r _o_p_t_i_o_n_s and under _A_p_p_l_y _C_l_a_s_s_i_f_i_e_r'_s
- _t_e_s_t _c_l_a_s_s_i_f_i_e_r.
-
- +o It is now possible to backfit the test data into the
- classifier after estimating the classifier's accuracy.
- This mode is on by default and can be modified in
- _F_u_r_t_h_e_r _i_n_d_u_c_e_r _o_p_t_i_o_n_s. It allows users to see the
- actual record counts/weights, rather than those that
- only appeared in the training set. Fitting the test
- data into a classifier updates the probability
- estimates without altering the structure of the
- classifier. Backfitting can reduce the error rate.
-
- +o The apply classifier options have been extended to
- allow testing a classifier against a test set and
- fitting new data to previously created classifiers.
- Fitting new data can be useful if large amounts of data
- are available: a model can be built using a sample and
- the bigger dataset can be used to update the model
- counts and probability estimates.
-
- +o The Laplace correction for the Evidence Inducer now
- supports an automatic correction that has been
- empirically determined to be more accurate in many
- real-world datasets.
-
- +o The _A_u_t_o_m_a_t_i_c _c_o_l_u_m_n _s_e_l_e_c_t_i_o_n in the Evidence Inducer
- now supports a faster "forward" mode.
-
- +o Uniform Weight has been added to the set of automatic
- binning approaches. Under uniform weight binning
- thresholds are identified that partition the records
- into subsets of equal weight.
-
- +o It is now possible to trim a specified percent of the
- most extreme values prior to generating uniform range
- or uniform weight bins.
-
- +o The binning panel now supports using the training set
- only, weighted records, and automatic determination of
- weight per bin.
-
- +o Automatic binning time (entropy-based) has been reduced
- by a factor of about 15-20. This dramatically reduces
- the running time for the Evidence Inducer or when the
- automatic binning is used in the binning panel.
-
- +o Reading time (initial loading of data passed by
- datamove) has been reduced by about 20-25%.
-
-
-
-
-
-
-
-
-
-
-
- - 9 -
-
-
-
- +o Classification models now require only the actual
- attributes that are used in order to apply them to new
- data. Specifically, if a decision tree uses only three
- attributes, only those will be required to apply it.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-